library(data.table)      # data.table type
library(tidyr)           # gather()
library(kableExtra)      # kables for html document
library(naniar)          # vis_miss()
library(corrplot)        # corrplot()
library(caret)           # findCorrelation()
library(REdaS)           # KMOS(), bart_spher()
library(psych)           # KMO(), fa(), principal()

options(scipen = 999)
# library(pastecs)
# library(dplyr)
# library(rcompanion) 
# library(GPArotation)
# library(nFactors)
# library(knitr)

We start by loading both data files.

descriptions <- fread("Data File_Case Study_Factor Analysis_Variables.csv")
df <- read.csv("Data File_Case_Study_Factor Analysis_MD.csv")

We select only the relevant columns which correspond to question numbers qd1 to qd33 inside df.

df1 <- df[,c(9:41)]

We check for missing values.

vis_miss(df1)

Since questions 22, 26, 28 and 29 contain missing values, we delete the corresponding rows.

df1 <- na.omit(df1)
vis_miss(df1)

Now, our data frame df1 has no more missing values.

1 Orthogonal Factor Analysis

We proceed with this analysis assuming independence between the factor we will attempt to extract.

1.1 Inspect the correlation matrix.

We compute the correlation matrix and plot it.

raqMatrix <- cor(df1)
corrplot(raqMatrix, tl.cex = 0.8)

We can see some strong correlations, indicated by bigger and darker blue dots, between different questions. Furthermore, we notice that qd4 and qd23 seem to be the only two items with low correlations with all the other items. They are standalone items.

A strong correlation between two variables is most likely a signal of redundancy. In regards to the correlation plot above, there are not very clear patterns, but we can observe some variables are highly correlated between one another, which is good since we want to perform a factor analysis. More precisely, we are interested in the underlying factors causing those variables to move the way that they move. A weaker correlation indicates that there are different factors affecting the variables, meaning they do not describe the same domain.

Listed below are the variables displaying higher correlation, with a selected threshold of 0.65.

n = nrow(raqMatrix)
for (i in 1:(n-1)) {
  for (j in (i+1):n) {
    if (raqMatrix[i, j] >= 0.65) {
      print(paste(colnames(df1)[i],
                  "and",
                  colnames(df1)[j],
                  "have a correlation of",
                  round(raqMatrix[i, j], 2)))
    }
  }
}
## [1] "qd1 and qd10 have a correlation of 0.84"
## [1] "qd1 and qd20 have a correlation of 0.84"
## [1] "qd1 and qd27 have a correlation of 0.84"
## [1] "qd2 and qd5 have a correlation of 0.76"
## [1] "qd2 and qd7 have a correlation of 0.82"
## [1] "qd2 and qd12 have a correlation of 0.81"
## [1] "qd2 and qd16 have a correlation of 0.69"
## [1] "qd3 and qd11 have a correlation of 0.76"
## [1] "qd3 and qd13 have a correlation of 0.76"
## [1] "qd3 and qd30 have a correlation of 0.74"
## [1] "qd5 and qd7 have a correlation of 0.76"
## [1] "qd5 and qd12 have a correlation of 0.73"
## [1] "qd5 and qd16 have a correlation of 0.71"
## [1] "qd6 and qd8 have a correlation of 0.74"
## [1] "qd6 and qd18 have a correlation of 0.73"
## [1] "qd6 and qd25 have a correlation of 0.81"
## [1] "qd7 and qd12 have a correlation of 0.81"
## [1] "qd7 and qd16 have a correlation of 0.73"
## [1] "qd8 and qd25 have a correlation of 0.72"
## [1] "qd9 and qd14 have a correlation of 0.83"
## [1] "qd9 and qd19 have a correlation of 0.88"
## [1] "qd9 and qd21 have a correlation of 0.79"
## [1] "qd9 and qd24 have a correlation of 0.85"
## [1] "qd10 and qd20 have a correlation of 0.83"
## [1] "qd10 and qd27 have a correlation of 0.83"
## [1] "qd11 and qd13 have a correlation of 0.74"
## [1] "qd11 and qd30 have a correlation of 0.7"
## [1] "qd12 and qd16 have a correlation of 0.69"
## [1] "qd13 and qd30 have a correlation of 0.83"
## [1] "qd14 and qd19 have a correlation of 0.84"
## [1] "qd14 and qd21 have a correlation of 0.82"
## [1] "qd14 and qd24 have a correlation of 0.87"
## [1] "qd15 and qd17 have a correlation of 0.86"
## [1] "qd15 and qd32 have a correlation of 0.83"
## [1] "qd17 and qd32 have a correlation of 0.84"
## [1] "qd18 and qd25 have a correlation of 0.7"
## [1] "qd19 and qd21 have a correlation of 0.79"
## [1] "qd19 and qd24 have a correlation of 0.85"
## [1] "qd20 and qd27 have a correlation of 0.84"
## [1] "qd21 and qd24 have a correlation of 0.84"
## [1] "qd22 and qd29 have a correlation of 0.84"
## [1] "qd22 and qd33 have a correlation of 0.81"
## [1] "qd26 and qd28 have a correlation of 0.66"
## [1] "qd26 and qd31 have a correlation of 0.67"
## [1] "qd28 and qd31 have a correlation of 0.68"
## [1] "qd29 and qd33 have a correlation of 0.79"
rm(i, j, n)

This allows us to separate the questionnaire items into clusters by grouping together variables which are highly correlated. We can obtain exactly the same groups by using the hclust option when plotting the correlation matrix. Rows and columns are swapped so items who are highly correlated can be side by side.

corrplot(raqMatrix, order = 'hclust', tl.cex = 0.8, addrect = 10)

The clustering of the correlation matrix with 10 clusters and our grouping with threshold at 0.65 give the same results. We get the following 10 different groups:

  • Items 1, 10, 20 and 27, all related to aesthetics;
  • Items 2, 5, 7, 12 and 16, all related to performance;
  • Items 3, 11, 13 and 30, all related to ease of use;
  • Items 6, 8, 18 and 25, all related to features;
  • Items 9, 14, 19, 21 and 24, all related to serviceability;
  • Items 15, 17 and 32, all related to quality of materials;
  • Items 22, 29 and 33, all related to flawlessness;
  • Items 26, 28 and 31, all related to durability;
  • Item 4, related to distinctiveness;
  • Item 23, related to prestige.

This grouping is quite close to quality dimensions defined in the literature (see Table 1 of the assignment pdf). The differences would be the presence of a construct about the quality of materials, which is not listed in the dimensions of quality, and the fact that questions 4 and 23 should reflect the same underlying construct, “Distinctiveness/Prestige”, but it does not seem to be the case from experimental data, as the two variables are very weekly correlated.

1.2 Check whether the data set and all of its variables are suitable for factor analysis.

To make sure that our data is suitable for Factor Analysis, we want to test to what extent our variables are correlated to one another. To further the analysis we measure the Sampling Adequacy based on the Kaiser-Meyer-Olking criterion:

#Kaiser-Meyer-Olkin Statistics
KMO(df1)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = df1)
## Overall MSA =  0.96
## MSA for each item = 
##  qd1  qd2  qd3  qd4  qd5  qd6  qd7  qd8  qd9 qd10 qd11 qd12 qd13 qd14 qd15 qd16 
## 0.96 0.97 0.96 0.64 0.98 0.96 0.97 0.98 0.96 0.97 0.96 0.97 0.96 0.96 0.95 0.98 
## qd17 qd18 qd19 qd20 qd21 qd22 qd23 qd24 qd25 qd26 qd27 qd28 qd29 qd30 qd31 qd32 
## 0.95 0.97 0.96 0.97 0.97 0.93 0.90 0.96 0.96 0.97 0.97 0.96 0.93 0.97 0.96 0.96 
## qd33 
## 0.95

The KMO test measures the proportion of variance between variables which can be attributed to common variance. It takes values between 0 and 1, meaning that higher values of the test indicate that a variable is suitable for factor analysis. The measurement is computed with the following formula: \[MO_j=\frac{\sum_{i\neq j}r^2_{ij}}{\sum_{i \neq j} r^2_{ij}+\sum_{i\neq j}p^2_{ij}}\] where \(r_{ij}\) is the correlation coefficient of the variables indexed by \(i\) and \(j\), and \(p_{ij}\) is the partial correlation coefficient of those variables.

Our KMO mean value amounts to 0.96 which indicates that in average variables are well suited for factor analysis. The individual MSA values of variables except for qd4 are all above 0.9, which corresponds to “marvelous” in correspondence with Kaiser’s nomenclature. The value of qd4 is considered as “mediocre”. We must look out for it in the next steps.

We will also take a look at the diagonal of the anti-image correlation matrix. This gives us the proportion of a variable’s variance that cannot be explained by any other variable.

# we use the KMO function from the psych package to retrieve the anti-image cor mat
anti_mat_diag = data.frame("Question" = 1:33,
                           "UniqueVar" = diag(KMO(df1)$ImCov))

ggplot(data = anti_mat_diag, aes(x = Question, y = UniqueVar)) +
  geom_col() +
  theme_classic() +
  scale_x_continuous(breaks = seq(1, 33, 1)) +
  scale_y_continuous(limits = c(0,1)) +
  labs(title = "Diagonal values of anti-image correlation matrix",
       y = "Proportion of unique variance")

rm(anti_mat_diag)

In our case, qd4 and qd23 have values above 0.5, indicating that their topics have not been sufficiently covered in the questionnaire. They clearly result as outliers compared to all the other variables.

Another way to judge whether our data is suitable for factor analysis would be Bartlett’s test. The null hypothesis of this test is that the sample originates from a population, where all variables are uncorrelated, or in other words that the correlation matrix equals the identity matrix. The test is more suited for cases in which each variable has less than 5 observations, because it is very sensitive within large samples and thus might not be a good fit for us.

The test results highly significant, meaning we can reject the null hypothesis, the data is fit for factor analysis.

bart_spher(df1)
##  Bartlett's Test of Sphericity
## 
## Call: bart_spher(x = df1)
## 
##      X2 = 30857.911
##      df = 528
## p-value < 2.22e-16

After having considered all the above elements, this data set looks suitable for factor analysis. The only two items we need to pay special attention to are qd4 and qd23, as they might not be as well suited as the rest.

1.3 Run principal axis factoring (PAF) and use varimax rotation.

PAF is an exploratory factor analysis method (or Factor Extraction Method), it attempts to find the least number of factors accounting for common variance of a set of variables. It uses a reduced correlation matrix, with a diagonal containing communalities.

It assumes that the variance of the observed variables cannot be fully explained by the extracted factors, it decomposes the variance in communality and unique variance, without accounting for specific and error variability.

PAF is implemented with the fa function from psych package.

We use the “varimax” rotation, which is designed to simplify the interpretation of the loadings. Varimax works by maximizing the sum of the variance of each squared factor loading. It pushes factors to become closer to 1 or 0, which facilitates interpretation.

The number of underlying factors has to be specified inside the function. If we try to input a number of factors greater than 13, fa returns an error. Thus, we test options for models going from 1 factor to 13 and record each time some criteria aimed at helping select the best model.

Those criterias are:

  • fit, which measures how well the factor model reproduces the correlation matrix;
  • objective, which is the value of the function that is minimized by a maximum likelihood procedure;
  • rms, which is the sum of the squared off-diagonal residuals divided by the degrees of freedom;
  • crms, which is the RMS adjusted for degrees of freedom;
  • TLI, which is the Tucker Lewis Index of factoring reliability;
  • BIC, which is the Bayesian information criterion.

A good model exhibits high values of fit and TLI, and low values of the others.

# we initiate and empty dataframe which will record our criteria values
fit_df <- matrix(nrow = 13, ncol = 6)
colnames(fit_df) <- c("fit", "objective", "rms", "crms", "TLI", "BIC")
fit_df <- as.data.frame(fit_df)

# we compute the factor analysis for nfactors between 1 and 13 and record criteria results in fit_df
for (i in 1:13) {
  FA <- fa(df1, nfactors = i, rotate = "varimax", fm = "pa")
  fit_df[i, 1] <- FA$fit
  fit_df[i, 2] <- FA$objective
  fit_df[i, 3] <- FA$rms
  fit_df[i, 4] <- FA$crms
  fit_df[i, 5] <- FA$TLI
  fit_df[i, 6] <- FA$BIC
}

rm(i, FA)

Now, we can plot the values of each criteria with respect to the number of factors.

fit_df %>% gather() %>%                 
  ggplot(aes(x = rep(1:13, ncol(fit_df)), y = value)) +
  facet_wrap(~ key, scales = "free") +
  geom_point() +
  theme_classic() +
  scale_x_continuous(breaks = seq(1, 13, 1)) +
  geom_vline(xintercept = 8, linetype = "dashed") +
  geom_vline(xintercept = 9, linetype = "dashed") +
  labs(title = "Various criteria for Principal Axis Factoring",
       x = "Number of factors",
       y = "")

The dashed lines designate the number of factors which seem the most appropriate for our data. The improvement in criteria for numbers of factors greater than 9 is much slower and less significant. Depending on the criterion, we would choose 8 or 9, so we will test both models with and without including questions 4 and 23, since our exploratory analysis of the data indicated that those two variables probably will not be loaded onto factors, as they are standalone.

FA8 <- fa(df1, nfactors = 8, rotate = "varimax", fm = "pa")
FA8_423 <- fa(df1[, -c(4, 23)], nfactors = 8, rotate = "varimax", fm = "pa")
FA9 <- fa(df1, nfactors = 9, rotate = "varimax", fm = "pa")
FA9_423 <- fa(df1[, -c(4, 23)], nfactors = 9, rotate = "varimax", fm = "pa")

We examine the factor loadings of each model, using a plot similar to the correlation plot to quickly visualize them.

par(mfrow = c(2,2))
corrplot(t(FA8$loadings),
         tl.cex = 0.7,
         title = "PAF with 8 factors \n Loadings",
         mar = c(0, 1, 3, 0))
corrplot(t(FA8_423$loadings),
         tl.cex = 0.7,
         title = "PAF with 8 factors excluding qd4 and qd23 \n Loadings",
         mar = c(0, 1, 3, 0))
corrplot(t(FA9$loadings),
         tl.cex = 0.7,
         title = "PAF with 9 factors \n Loadings",
         mar = c(0, 1, 3, 0))
corrplot(t(FA9_423$loadings),
         tl.cex = 0.7,
         title = "PAF with 9 factors excluding qd4 and qd23 \n Loadings",
         mar = c(0, 1, 3, 0))

par(mfrow = c(1,1))

We remark the following:

  • In FA8, questions 4 and 23 aren’t loaded on any factor. Each other variable is clearly loaded on a unique factor.
  • In FA8_423, each variable is clearly loaded on one factor like before, since we deleted the two variables which weren’t associated on any of the 8 factors. If we decide to keep a model with 8 factors, we could delete qd4 and qd23 as they will not be loaded onto a factor in any case.
  • In FA9, questions 4 and 23 are loaded onto one factor together and all other variables are loaded on one of the 8 other factors. The loadings of qd4 and qd23 are the lowest however, at 0.414 and 0.474 respectively.
  • In FA9_423, the ninth factor has no variable loaded on it, as we have deleted questions 4 and 23 and the other items are arranged on 8 factors. This model is not appropriate for the data and should be discarded.

In summary, there are two possible models:

  • A model with 8 factors which does not include qd4 and qd23;
  • A model with 9 factors in which qd4 and qd23 are loaded onto a factor together.

We discard FA8 and FA9_423 and display the numerical loadings greater than 0.3 of the two remaining models.

rm(FA8, FA9_423)

print(FA8_423$loadings, cutoff = 0.3, sort = TRUE)
## 
## Loadings:
##      PA2   PA1   PA5   PA4   PA3   PA7   PA6   PA8  
## qd9  0.827                                          
## qd14 0.851                                          
## qd19 0.832                                          
## qd21 0.785                                          
## qd24 0.860                                          
## qd2        0.723                                    
## qd5        0.645                                    
## qd7        0.738                                    
## qd12       0.711                                    
## qd16       0.616                                    
## qd1              0.797                              
## qd10             0.749                              
## qd20             0.772                              
## qd27             0.747                              
## qd3                    0.759                        
## qd11                   0.749                        
## qd13                   0.739                        
## qd30                   0.677                        
## qd22                         0.837                  
## qd29                         0.813                  
## qd33                         0.780                  
## qd6                                0.741            
## qd8        0.341                   0.558            
## qd18                               0.601            
## qd25                               0.762            
## qd15                                     0.749      
## qd17                                     0.744      
## qd32                                     0.715      
## qd26                                           0.620
## qd28                                           0.672
## qd31                                           0.660
## 
##                  PA2   PA1   PA5   PA4   PA3   PA7   PA6   PA8
## SS loadings    4.569 3.577 3.404 3.121 2.781 2.675 2.324 1.935
## Proportion Var 0.147 0.115 0.110 0.101 0.090 0.086 0.075 0.062
## Cumulative Var 0.147 0.263 0.373 0.473 0.563 0.649 0.724 0.787
print(FA9$loadings, cutoff = 0.3, sort = TRUE)
## 
## Loadings:
##      PA2    PA1    PA5    PA4    PA3    PA7    PA6    PA8    PA9   
## qd9   0.830                                                        
## qd14  0.853                                                        
## qd19  0.834                                                        
## qd21  0.788                                                        
## qd24  0.862                                                        
## qd2          0.727                                                 
## qd5          0.649                                                 
## qd7          0.743                                                 
## qd12         0.716                                                 
## qd16         0.619                                                 
## qd1                 0.796                                          
## qd10                0.747                                          
## qd20                0.772                                          
## qd27                0.748                                          
## qd3                        0.761                                   
## qd11                       0.750                                   
## qd13                       0.736                                   
## qd30                       0.674                                   
## qd22                              0.835                            
## qd29                              0.813                            
## qd33                              0.779                            
## qd6                                      0.735                     
## qd8          0.352                       0.559                     
## qd18                                     0.589                     
## qd25                                     0.746                     
## qd15                                            0.749              
## qd17                                            0.743              
## qd32                                            0.712              
## qd26                                                   0.616       
## qd28                                                   0.661       
## qd31                                                   0.653       
## qd4                                                           0.414
## qd23                                                          0.474
## 
##                  PA2   PA1   PA5   PA4   PA3   PA7   PA6   PA8   PA9
## SS loadings    4.632 3.667 3.412 3.134 2.783 2.559 2.320 1.882 0.517
## Proportion Var 0.140 0.111 0.103 0.095 0.084 0.078 0.070 0.057 0.016
## Cumulative Var 0.140 0.251 0.355 0.450 0.534 0.612 0.682 0.739 0.755

We notice that the factors created by PAF are the same groups we identified with our clustering of the correlation matrix. In both models, qd8, which was grouped with the features construct, is partially loaded onto factor PA1, which corresponds to the performance construct. We fetch its description.

descriptions[name %in% "qd8", ]
##    name                                                         Label
## 1:  qd8 the extra features offered in my smartphone  usually function

Some people answering might consider not only extra features but any features of their smartphone and evaluate their performance, i.e. how they function. This might explain why qd8 is partially loaded onto the performance construct. However, it remains mainly loaded onto the features construct, with a value of 0.558 or 0.559 according to the model.

In model FA9, loadings of qd4 and qd23 on their factor are the weakest of all “main” loadings. This could indicate that keeping qd4 and qd23 might not be totally appropriate and is up to discussion.

When looking at the communalities of our models, we get the following results:

sort(FA8_423$communality)
##      qd18      qd16      qd26      qd28       qd8      qd31      qd11       qd5 
## 0.6266777 0.6402010 0.6551763 0.6849299 0.6855023 0.7001560 0.7169741 0.7224583 
##      qd33       qd3      qd21      qd30      qd12      qd25       qd2      qd13 
## 0.7612886 0.7629531 0.7706646 0.7756551 0.7829168 0.8004648 0.8052923 0.8058405 
##      qd32      qd29      qd10       qd7      qd20      qd27       qd6       qd9 
## 0.8295319 0.8312862 0.8333518 0.8357790 0.8360730 0.8390860 0.8392665 0.8406325 
##      qd19      qd14       qd1      qd15      qd22      qd17      qd24 
## 0.8435205 0.8526145 0.8538126 0.8560076 0.8572538 0.8573949 0.8826324
sort(FA9$communality)
##       qd4      qd23      qd16      qd26      qd18      qd28       qd8      qd31 
## 0.1860315 0.2740969 0.6409344 0.6570266 0.6693727 0.6805048 0.6954544 0.7034429 
##      qd11       qd5      qd33       qd3      qd21      qd30      qd12      qd25 
## 0.7194675 0.7286179 0.7617402 0.7666282 0.7707646 0.7740474 0.7848395 0.7916454 
##      qd13       qd2      qd32      qd29      qd10      qd20       qd7      qd27 
## 0.8026750 0.8047862 0.8286370 0.8325405 0.8333801 0.8365920 0.8368335 0.8404022 
##       qd9       qd6      qd19      qd14       qd1      qd22      qd17      qd15 
## 0.8405512 0.8421117 0.8439347 0.8525442 0.8531018 0.8557809 0.8564535 0.8588101 
##      qd24 
## 0.8825049

This further indicates that questions 4 and 23 might not be appropriate in for factor analysis, as their communalities inside FA9 are extremely low. The communality of an item is the sum of the squared factor loadings for that item. It measures the proportion of the item’s variance that can be explained by the underlying factors. If the communality equals 1 then a complete replication of the variable can me obtained through the sole use of factors.

We observe the scree plots of our two models.

ggplot(mapping = aes(x = 1:length(FA8_423$values),
                     y = FA8_423$values,
                     color = (FA8_423$values >= 1))) +
  geom_point() +
  geom_hline(yintercept = 1, linetype = "dashed") +
  theme_classic() +
  labs(title = "Scree plot of FA8_423",
       x = "Factor Number",
       y = "Eigenvalue",
       color = "Eigenvalue >= 1")

ggplot(mapping = aes(x = 1:length(FA9$values),
                     y = FA9$values,
                     color = (FA9$values >= 1))) +
  geom_point() +
  geom_hline(yintercept = 1, linetype = "dashed") +
  theme_classic() +
  labs(title = "Scree plot of FA9",
       x = "Factor Number",
       y = "Eigenvalue",
       color = "Eigenvalue >= 1")

The eigenvalues are plotted above, the horizontal line represents the significance threshold of 1. If we would only select based on the eigenvalues bigger than 1, then our model would have only 5 factors. However, due to all our previous analysis, this does not seem to be the best choice. The elbow rule would probably cut at 8 factors for FA8_423, as there is a bigger gap between 8 and 9. However, FA9 does not have such a gap and the descent is quite smooth, so it is more difficult to determine.

1.4 Decide on a final principal factor analysis solution.

After reviewing all the information we gathered about our model, we decide to keep FA8_423 over FA9.

Questions 4 and 23 Showed very little correlation with each other and even if they should theoretically represent the same construct of distinctiveness and prestige, the construct was not explored enough through questions in the survey. Every other construct had between 3 and 5 questions and the questions exhibit high correlations when they are in the same group.

Furthermore, the BIC is lowest for 8 factors and starts slowly rising again once we get past 8. The loadings of qd4 and qd23 in the FA9 model remain lower than all other loadings of items inside their construct, being even lower than 0.5. When we looked at the anti-image correlation matrix diagonal, the proportions of unique variance in questions 4 and 23 were dominating the rest, indicating that the rest of the data set cannot explain most of their variance. Moreoever, their communalities were very low in FA9.

These reasons combined made us choose model FA8_423. Its constructs are the ones we listed when we inspected the correlation matrix (except of course the constructs of questions 4 and 23). The loading patterns of FA8_423, as we discussed in the previous point, are clear.

1.5 Run a principal component analysis (PCA) and compare the results.

Like with PAF, we run models with number of components ranging from 1 to 33 and will examine some criteria to get an idea of which ones best fit our data.

For PCA, the criteria are:

  • fit, which measures how well the components model reproduces the correlation matrix;
  • fit.off, which measures how well the off-diagonal elements of the correlation matrix are reproduced;
  • objective, which is the value of the function that is minimized by a maximum likelihood procedure;
  • rms, which is the sum of the squared off-diagonal residuals divided by the degrees of freedom.
# we initiate and empty dataframe which will record our criteria values
fit_df <- matrix(nrow = 33, ncol = 4)
colnames(fit_df) <- c("fit", "fit.off", "objective", "rms")
fit_df <- as.data.frame(fit_df)

# we compute the PCA for nfactors between 1 and 33 and record criteria results in fit_df
for (i in 1:33) {
  PCA <- principal(df1, nfactors = i, rotate = "varimax")
  fit_df[i, 1] <- PCA$fit
  fit_df[i, 2] <- PCA$fit.off
  fit_df[i, 3] <- PCA$objective
  fit_df[i, 4] <- PCA$rms
}

rm(i, PCA)

We plot the results like before.

fit_df %>% gather() %>%                 
  ggplot(aes(x = rep(1:33, ncol(fit_df)), y = value)) +
  facet_wrap(~ key, scales = "free") +
  geom_point() +
  theme_classic() +
  scale_x_continuous(breaks = seq(1, 33, 2)) +
  geom_vline(xintercept = 10, linetype = "dashed") +
  labs(title = "Various criteria for Principal Component Analysis",
       x = "Number of components",
       y = "")

Here, the ideal number of components seems to be 10, indicated by a dashed line. It is a local minimum of the objective function and is at thw elbow of the other three curves.

However, we check if this result would vary by omitting standalone variables qd4 and qd23.

# we initiate and empty dataframe which will record our criteria values
fit_df <- matrix(nrow = 31, ncol = 4)
colnames(fit_df) <- c("fit", "fit.off", "objective", "rms")
fit_df <- as.data.frame(fit_df)

# we compute the PCA for nfactors between 1 and 31 and record criteria results in fit_df
for (i in 1:31) {
  PCA <- principal(df1[, -c(4,23)], nfactors = i, rotate = "varimax")
  fit_df[i, 1] <- PCA$fit
  fit_df[i, 2] <- PCA$fit.off
  fit_df[i, 3] <- PCA$objective
  fit_df[i, 4] <- PCA$rms
}

rm(i, PCA)

Again, we plot the evolution of the criteria.

fit_df %>% gather() %>%                 
  ggplot(aes(x = rep(1:31, ncol(fit_df)), y = value)) +
  facet_wrap(~ key, scales = "free") +
  geom_point() +
  theme_classic() +
  scale_x_continuous(breaks = seq(1, 31, 2)) +
  geom_vline(xintercept = 8, linetype = "dashed") +
  labs(title = "Various criteria for Principal Component Analysis without qd4 and qd23",
       x = "Number of components",
       y = "")

rm(fit_df)

Without questions 4 and 23, the best model seems to be the one with 8 components.

Thus, we will test these three models.

PCA8 <- principal(df1, nfactors=8, rotate = "varimax")
PCA8_423 <- principal(df1[, -c(4, 23)], nfactors = 8, rotate = "varimax")
PCA10 <- principal(df1, nfactors = 10, rotate = "varimax")

We examine the factor loadings, first by plotting them.

par(mfrow = c(2,1))
corrplot(t(PCA8$loadings),
         tl.cex = 0.7,
         title="CPA with 8 factors including qd4 and dq23 \n Loadings",
         mar = c(1, 1, 3, 0))
corrplot(t(PCA8_423$loadings),
         tl.cex = 0.7,
         title = "CPA with 8 factors excluding qd4 and qd23 \n Loadings",
         mar = c(1, 1, 3, 0))

par(mfrow = c(1,1))
corrplot(t(PCA10$loadings),
         t.cex = 0.7,
         title = "CPA with 10 factors \n Loadings",
         mar = c(1, 1, 3, 0))

In both PCA8_423 and PCA10, each questions looks strongly loaded onto one component. In PCA8_423, question 4 and 23 each have their own component and their loadings are the strongest ones, in opposition to what we observed for PAF.

However, PCA8 has some less clearer loadings, in particular when looking at factor RC7. The two groups we had defined as durability and quality of materials are pushed together onto one component, but the loadings given to durability are smaller than when they are allowed to be by themselves onto one component like in PCA8_423.

We take a look at the numerical values of the loadings, displaying only the ones above 0.3.

print(PCA8$loadings, cutoff = 0.3, sort=TRUE)
## 
## Loadings:
##      RC2    RC1    RC7    RC5    RC4    RC8    RC3    RC6   
## qd9   0.847                                                 
## qd14  0.868                                                 
## qd19  0.851                                                 
## qd21  0.823                                                 
## qd24  0.868                                                 
## qd2          0.758                                          
## qd5          0.700                                          
## qd7          0.755                                          
## qd12         0.742                                          
## qd16         0.703                                          
## qd15                0.768                                   
## qd17                0.740                                   
## qd26                0.597                       0.322       
## qd28         0.339  0.590                                   
## qd31                0.692                                   
## qd32                0.720  0.309                            
## qd1                        0.816                            
## qd10                       0.776                            
## qd20                       0.801                            
## qd27                       0.778                            
## qd3                               0.806                     
## qd11                              0.812                     
## qd13                              0.760                     
## qd30                              0.706                     
## qd6                                      0.752              
## qd8          0.349                       0.620              
## qd18                                     0.700              
## qd25                                     0.791              
## qd22                                            0.849       
## qd29                                            0.836       
## qd33                                            0.838       
## qd4                                                    0.805
## qd23                                                   0.677
## 
##                  RC2   RC1   RC7   RC5   RC4   RC8   RC3   RC6
## SS loadings    4.682 3.973 3.931 3.544 3.306 2.952 2.862 1.240
## Proportion Var 0.142 0.120 0.119 0.107 0.100 0.089 0.087 0.038
## Cumulative Var 0.142 0.262 0.381 0.489 0.589 0.678 0.765 0.803
print(PCA8_423$loadings, cutoff = 0.3, sort = TRUE)
## 
## Loadings:
##      RC2   RC1   RC5   RC4   RC7   RC6   RC3   RC8  
## qd9  0.847                                          
## qd14 0.868                                          
## qd19 0.851                                          
## qd21 0.823                                          
## qd24 0.869                                          
## qd2        0.754                                    
## qd5        0.704                                    
## qd7        0.758                                    
## qd12       0.750                                    
## qd16       0.708                                    
## qd1              0.822                              
## qd10             0.778                              
## qd20             0.803                              
## qd27             0.774                              
## qd3                    0.807                        
## qd11                   0.819                        
## qd13                   0.763                        
## qd30                   0.706                        
## qd6                          0.762                  
## qd8        0.339             0.626                  
## qd18                         0.715                  
## qd25                         0.806                  
## qd22                               0.854            
## qd29                               0.841            
## qd33                               0.842            
## qd15                                     0.787      
## qd17                                     0.780      
## qd32                                     0.765      
## qd26                                           0.727
## qd28                                           0.773
## qd31                                           0.738
## 
##                  RC2   RC1   RC5   RC4   RC7   RC6   RC3   RC8
## SS loadings    4.673 3.908 3.492 3.297 3.020 2.856 2.492 2.337
## Proportion Var 0.151 0.126 0.113 0.106 0.097 0.092 0.080 0.075
## Cumulative Var 0.151 0.277 0.389 0.496 0.593 0.685 0.766 0.841
print(PCA10$loadings, cutoff = 0.3, sort = TRUE)
## 
## Loadings:
##      RC2    RC1    RC6    RC4    RC8    RC3    RC5    RC9    RC10   RC7   
## qd9   0.849                                                               
## qd14  0.869                                                               
## qd19  0.852                                                               
## qd21  0.824                                                               
## qd24  0.870                                                               
## qd2          0.757                                                        
## qd5          0.706                                                        
## qd7          0.761                                                        
## qd12         0.753                                                        
## qd16         0.710                                                        
## qd1                 0.822                                                 
## qd10                0.778                                                 
## qd20                0.803                                                 
## qd27                0.775                                                 
## qd3                        0.805                                          
## qd11                       0.818                                          
## qd13                       0.764                                          
## qd30                       0.707                                          
## qd6                               0.759                                   
## qd8          0.345                0.628                                   
## qd18                              0.698                                   
## qd25                              0.805                                   
## qd22                                     0.854                            
## qd29                                     0.841                            
## qd33                                     0.842                            
## qd15                                            0.785                     
## qd17                                            0.779                     
## qd32                                            0.763                     
## qd26                                                   0.725              
## qd28                                                   0.772              
## qd31                                                   0.735              
## qd23                                                          0.972       
## qd4                                                                  0.989
## 
##                  RC2   RC1   RC6   RC4   RC8   RC3   RC5   RC9  RC10   RC7
## SS loadings    4.714 3.960 3.502 3.300 2.958 2.851 2.476 2.313 1.017 1.011
## Proportion Var 0.143 0.120 0.106 0.100 0.090 0.086 0.075 0.070 0.031 0.031
## Cumulative Var 0.143 0.263 0.369 0.469 0.559 0.645 0.720 0.790 0.821 0.852

Like with PAF, we notice that qd8 is partially loaded onto the component corresponding to performance. However, compared with PAF, the loadings are globally slightly higher. The loadings of questions 4 and 23 are extremely high for PCA10.

The loadings for PCA8 have a lot of variables with low loadings, which does not seem to be the best option, given that comparatively the other models look much nicer.

We take a look at the communalities.

sort(PCA8$communality)
##      qd23      qd28      qd26      qd31       qd4      qd16       qd8      qd18 
## 0.5482221 0.6110891 0.6253977 0.6812121 0.6860378 0.7228627 0.7472794 0.7489624 
##       qd5      qd17      qd32      qd11      qd12      qd30      qd15      qd21 
## 0.7758762 0.7983085 0.7991891 0.8061080 0.8124165 0.8130471 0.8143803 0.8251173 
##       qd3       qd2      qd13      qd25       qd7      qd33       qd6      qd10 
## 0.8318214 0.8341482 0.8381954 0.8392376 0.8480460 0.8514121 0.8526109 0.8693608 
##       qd9      qd29      qd19      qd20      qd27      qd14       qd1      qd22 
## 0.8712220 0.8731078 0.8735513 0.8750159 0.8756005 0.8803126 0.8806015 0.8835707 
##      qd24 
## 0.8966449
sort(PCA8_423$communality)
##      qd16      qd18       qd8      qd26       qd5      qd31      qd28      qd30 
## 0.7281547 0.7417442 0.7445806 0.7696596 0.7781026 0.7916366 0.8042188 0.8142034 
##      qd11      qd12      qd21       qd3       qd2      qd13      qd25       qd7 
## 0.8172090 0.8230745 0.8252227 0.8320578 0.8362610 0.8401490 0.8514445 0.8541838 
##      qd33       qd6       qd9      qd10      qd19      qd27      qd20      qd29 
## 0.8570641 0.8581700 0.8717486 0.8732182 0.8739140 0.8771753 0.8792219 0.8799137 
##      qd14      qd32       qd1      qd22      qd24      qd17      qd15 
## 0.8809326 0.8882966 0.8907906 0.8911649 0.8980463 0.9009975 0.9012585
sort(PCA10$communality)
##      qd16      qd18       qd8      qd26       qd5      qd31      qd28      qd30 
## 0.7283694 0.7516579 0.7530489 0.7708933 0.7800535 0.7916076 0.8055633 0.8154692 
##      qd11      qd12      qd21       qd3       qd2      qd13       qd7      qd25 
## 0.8173072 0.8235556 0.8254960 0.8325685 0.8363938 0.8418185 0.8548721 0.8549029 
##      qd33       qd6       qd9      qd10      qd19      qd27      qd29      qd20 
## 0.8571892 0.8608017 0.8717715 0.8735724 0.8741658 0.8785617 0.8801344 0.8802786 
##      qd14      qd32       qd1      qd22      qd24      qd17      qd15      qd23 
## 0.8810422 0.8879448 0.8910183 0.8921397 0.8982019 0.9010692 0.9017948 0.9911283 
##       qd4 
## 0.9965632

Unlike with PAF, in PCA10, the underlying components in our model, which are dedicated to qd4 and qd23, explain most of the variance of the two variables. This gives both models high communalities for all variables, with the lowest values being around 0.73.

The communalities for PCA8 are lower for many of the variables, observing this and the unclear loadings, we decide to not move further with this model.

rm(PCA8)

We display the scree plots.

ggplot(mapping = aes(x = 1:length(PCA8_423$values),
                     y = PCA8_423$values,
                     color = (PCA8_423$values >= 1))) +
  geom_point() +
  geom_hline(yintercept = 1, linetype = "dashed") +
  theme_classic() +
  labs(title = "Scree plot of PCA8_423",
       x = "Component Number",
       y = "Eigenvalue",
       color = "Eigenvalue >= 1")

ggplot(mapping = aes(x = 1:length(PCA10$values),
                     y = PCA10$values,
                     color = (PCA10$values >= 1))) +
  geom_point() +
  geom_hline(yintercept = 1, linetype = "dashed") +
  theme_classic() +
  labs(title = "Scree plot of PCA10",
       x = "Component Number",
       y = "Eigenvalue",
       color = "Eigenvalue >= 1")

Again, the rule of taking values above 1 seems slightly more strict than what our analysis seems to indicate. However, both models have a small but distinct elbow, which corresponds to the number of components selected for each.

We take a look at the the total variance explained by each component in our models.

Let us recall that the eigenvalue of a component represents the amount of the total variance explained (TVE) by that component. We can directly fetch these numbers inside our models.

PCA8_423$Vaccounted
##                             RC2       RC1       RC5       RC4       RC7
## SS loadings           4.6725207 3.9075209 3.4924136 3.2967977 3.0197441
## Proportion Var        0.1507265 0.1260491 0.1126585 0.1063483 0.0974111
## Cumulative Var        0.1507265 0.2767755 0.3894340 0.4957824 0.5931935
## Proportion Explained  0.1792036 0.1498638 0.1339433 0.1264409 0.1158152
## Cumulative Proportion 0.1792036 0.3290674 0.4630107 0.5894516 0.7052668
##                              RC6        RC3        RC8
## SS loadings           2.85556858 2.49242392 2.33682587
## Proportion Var        0.09211512 0.08040077 0.07538148
## Cumulative Var        0.68530857 0.76570934 0.84109082
## Proportion Explained  0.10951863 0.09559107 0.08962347
## Cumulative Proportion 0.81478546 0.91037653 1.00000000

Thus, with 8 factors the percentage of the total variance which is explained by the model is 84.11%.

PCA10$Vaccounted
##                             RC2       RC1       RC6        RC4        RC8
## SS loadings           4.7136511 3.9601108 3.5020257 3.29975846 2.95816184
## Proportion Var        0.1428379 0.1200034 0.1061220 0.09999268 0.08964127
## Cumulative Var        0.1428379 0.2628413 0.3689633 0.46895594 0.55859721
## Proportion Explained  0.1677399 0.1409244 0.1246230 0.11742513 0.10526908
## Cumulative Proportion 0.1677399 0.3086643 0.4332873 0.55071245 0.65598154
##                             RC3        RC5        RC9       RC10        RC7
## SS loadings           2.8508701 2.47567244 2.31255667 1.01696953 1.01117878
## Proportion Var        0.0863900 0.07502038 0.07007747 0.03081726 0.03064178
## Cumulative Var        0.6449872 0.72000759 0.79008506 0.82090232 0.85154410
## Proportion Explained  0.1014510 0.08809923 0.08229459 0.03618986 0.03598379
## Cumulative Proportion 0.7574325 0.84553176 0.92782636 0.96401621 1.00000000

Thus, with 10 factors the percentage of the total variance which is explained by the model is 85.15%.

Both results are quite close and our factors explain most of the variance in the data.

To be coherent with the best model we chose for PAF, we will select PCA8_423. PCA10 works well from the point of view of principal component analysis, because its goal is to produce an empirical summary of the data set. However, the components of PCA do not necessarily have a causal interpretation, in opposition to PFA, which aims find underlying constructs of the data, called factors. Since our PAF analysis showed that qd4 and qd23 are not adapted to this modelling via factors because of their standalone nature, to preserve the most meaningful interpretation of our data set, we will continue by keeping FA8_423 and PCA8_423.

rm(FA9, PCA10)

1.6 What do the eigenvalues of the factors (quality dimensions) tell us about their relevance for repurchase behaviour?

As we saw previously, both FA8_423 and PCA8_423 have showed clear loading patterns. We now look at the TVE tables again to gain more insight on how the eigenvalues (which are the sum of squared loadings by factor or component) describe the portion of each factor or component.

TVE_PAF <- FA8_423$Vaccounted
TVE_PCA <- PCA8_423$Vaccounted

# we rename each factor or component to designate the quality dimensions we have established
colnames(TVE_PAF) <- c("Serviceability", "Performance", "Aesthetics", "Ease of use",
                       "Flawlessness", "Features", "Quality of materials", "Durability")
colnames(TVE_PCA) <- c("Serviceability", "Performance", "Aesthetics", "Ease of use",
                       "Features", "Flawlessness", "Quality of materials", "Durability")

TVE_PAF %>% kable(digits = 3) %>% kable_styling(full_width = TRUE)
Serviceability Performance Aesthetics Ease of use Flawlessness Features Quality of materials Durability
SS loadings 4.569 3.577 3.404 3.121 2.781 2.675 2.324 1.935
Proportion Var 0.147 0.115 0.110 0.101 0.090 0.086 0.075 0.062
Cumulative Var 0.147 0.263 0.373 0.473 0.563 0.649 0.724 0.787
Proportion Explained 0.187 0.147 0.140 0.128 0.114 0.110 0.095 0.079
Cumulative Proportion 0.187 0.334 0.474 0.602 0.716 0.825 0.921 1.000
TVE_PCA %>% kable(digits = 3) %>% kable_styling(full_width = TRUE)
Serviceability Performance Aesthetics Ease of use Features Flawlessness Quality of materials Durability
SS loadings 4.673 3.908 3.492 3.297 3.020 2.856 2.492 2.337
Proportion Var 0.151 0.126 0.113 0.106 0.097 0.092 0.080 0.075
Cumulative Var 0.151 0.277 0.389 0.496 0.593 0.685 0.766 0.841
Proportion Explained 0.179 0.150 0.134 0.126 0.116 0.110 0.096 0.090
Cumulative Proportion 0.179 0.329 0.463 0.589 0.705 0.815 0.910 1.000

In both PAF and PCA, the top three factors or components explaining the bigger proportions of variance are Serviceability, Performance and Aesthetics. The only change of order between factors is Features and Flawlessness which switch places between the two models. To better visualize the repartition of explained variance among factors, we display two Pareto charts.

ggplot(mapping = aes(x = reorder(colnames(TVE_PAF), -TVE_PAF[4,]), y = TVE_PAF[4,])) +
  geom_col() +
  geom_line(aes(x = 1:8, y = TVE_PAF[5,]), color = "red") +
  geom_point(aes(x = 1:8, y = TVE_PAF[5,])) +
  theme_classic() +
  labs(x = "Factor",
       y = "Proportion of TVE",
       title = "Pareto chart of TVE for FA8_423")

ggplot(mapping = aes(x = reorder(colnames(TVE_PCA), -TVE_PCA[4,]), y = TVE_PCA[4,])) +
  geom_col() +
  geom_line(aes(x = 1:8, y = TVE_PCA[5,]), color = "red") +
  geom_point(aes(x = 1:8, y = TVE_PCA[5,])) +
  theme_classic() +
  labs(x = "Component",
       y = "Proportion of TVE",
       title = "Pareto chart of TVE for PCA8_423")

rm(TVE_PAF, TVE_PCA)

In both cases, Serviceability is the dimension which is more important, with the biggest gap compared to other dimensions. The Pareto curve is very smooth and flat, which means that the importance of each dimension is a slow decrease. This can also be observed by the fact that gaps are minimal between each bar once we pass Serviceability. This means that even if there are some more important dimensions, all dimensions bring their share in explaining part of the TVE, there does not seem to be a superfluous or unimportant dimension.

2 Oblique Factor Analysis

2.1 Run an oblique factor analysis with the variables of your final solution in exercise 1 using Promax.

We run the OFA with principal axis factoring like before. What changes is the rotation method, which is not orthogonal anymore, but accounts for correlations between factors (thus the term oblique).

OFA8_423 <- fa(df1[, -c(4, 23)], nfactors = 8, rotate = "promax", fm = "pa")

2.2 How high are the factors correlated and what is the highest correlation between factors?

We display a correlation plot of the Phi matrix.

cor_fac <- OFA8_423$Phi

# changing to names of constructs for interpretability

colnames(cor_fac) <- c("Serviceability", "Performance", "Aesthetics", "Ease of use",
                       "Features", "Quality of materials", "Flawlessness", "Durability")

rownames(cor_fac) <- c("Serviceability", "Performance", "Aesthetics", "Ease of use",
                       "Features", "Quality of materials", "Flawlessness", "Durability")

corrplot(cor_fac)

The highest correlation is between Performance and Features at 0.72. Both factors are quite close, as the performance of a smartphone can be linked to the performance of the features of said smartphone, and conversely one can think about the performance of a particular extra feature. The two topics having some overlap, as we also saw for example in past models with qd8 being partially loaded onto the Performance construct, it is not surprising that they are quite strongly correlated.

The lowest correlation is between Serviceability and Flawlessness, at 0.40. This also seems logical, as the relationship between customer service and the defects or glitches of a smartphone is quite removed, we would expect them not to be strongly correlated.

We display a boxplot of the values of the correlations between factors (except for the diagonal values which are always 1).

cor_fac <- cor_fac * upper.tri(cor_fac)
cor_fac <- c(cor_fac)
cor_fac <- cor_fac[cor_fac != 0]

ggplot(mapping = aes(y = cor_fac)) +
  geom_boxplot(width = 0.6) +
  theme_classic() +
  scale_x_continuous(limits = c(-0.5,0.5)) +
  theme(axis.ticks.x = element_blank(), axis.text.x = element_blank()) +
  labs(y = "Correlation values",
       title = "Boxplot of correlations between factors for OFA8_423")

rm(cor_fac)

Correlations between distinct factors range, as we saw, between 0.4 and 0.7. The median is around 0.55, with half the values being roughly between 0.51 and 0.61. This shows that correlations between factors are quite strong. Orthogonal factor analysis done in exercise 1 might not have been the best model to describe the factors, as correlations are significant.

2.3 Compute factor scores for your factors and name and label them appropriately.

We can fetch the factor scores directly inside our model. We rename the columns as we did with the correlation matrix before and display the first six rows of our factor scores.

factors <- as.data.frame(OFA8_423$scores)

colnames(factors) <- c("Serviceability", "Performance", "Aesthetics", "Ease of use",
                       "Features", "Quality of materials", "Flawlessness", "Durability")

head(factors) %>% kable(digits = 2) %>% kable_styling(full_width = TRUE)
Serviceability Performance Aesthetics Ease of use Features Quality of materials Flawlessness Durability
-1.09 -0.59 -1.42 -0.71 -0.94 -1.11 -1.04 -0.27
0.14 -0.20 -0.10 -0.62 0.26 0.05 -1.04 -0.84
-1.09 -0.54 -1.41 -0.83 -0.82 -1.11 -0.20 0.17
0.34 1.08 1.22 0.98 1.34 0.96 1.17 0.89
-1.06 -0.49 -0.16 0.50 -0.92 0.09 0.01 0.29
-1.10 -0.49 -1.46 -1.37 -0.92 -0.75 -0.04 -0.52

2.4 Compute mean scores for the three willingness to pay premium items wtp1, wtp2, wtp3 and the two repurchase intention items ri1, ri2.

We create a dataframe selecting all columns which will be useful for the rest of this exercise. We remove NA values from the dataframe.

dfobl <- df[, c(9:41, 43,44,45, 51,52, 74)]
dfobl <- na.omit(dfobl)

We use factor analysis to construct two factors from the five variables mentioned. For reference we will present the questions each variable corresponds to: wtp1: I am likely to pay a little bit more for using my current smartphone’s brand wtp2: If my current smartphone’s brand were to raise the price by 10%, I would likely remain with it wtp3: I am willing to pay more for my current smartphone’s brand ri1: I intend to buy my current smartphone’s brand in my next purchase ri2: If I want to buy a smartphone again, I will buy my current brand

y_factor_model <- fa(dfobl[, 34:38], nfactors = 2, rotate = "promax", fm = "pa")
y_factor_model$loadings
## 
## Loadings:
##      PA1    PA2   
## wtp1  0.854       
## wtp2  0.530  0.331
## wtp3  0.969       
## ri1          0.941
## ri2          0.885
## 
##                 PA1   PA2
## SS loadings    1.95 1.781
## Proportion Var 0.39 0.356
## Cumulative Var 0.39 0.746
y_factor_model$Phi
##           PA1       PA2
## PA1 1.0000000 0.7371038
## PA2 0.7371038 1.0000000

Regarding the loadings we see that wtp2 loads on both factors. In fact if we re-read the question, we can see how that would be the case, the client is asked about a price raise and the likelihood of them sticking with said product. So the factors of willing to pay for premium items/services and repurchase intention intersect each other.

This also logically makes sense, because in the not precisely formulated question wtp2 there is space for interpretation in thinking that there might be repurchase implications.

Moreover, we look at the correlation matrix of the factors which this being factor analysis are allowed to be correlated, the correlation is 73.7%. Again, this is quite intuitive the willingness of paying for premium objects, speaks for an intentions of continuity. The user is willing to pay for extra services is interested in sticking to the brand of choice for a extented period of time.

# Computing mean scores
dfreg <- cbind(factors, y_factor_model$scores, df1$qd4, df1$qd23)
colnames(dfreg)[9:12] <- c("wtp", "ri", "qd4", "qd23")
dfreg <- scale(dfreg)
dfreg <- as.data.frame(dfreg)

2.5 Run a regression analysis with the factor scores of the quality dimensions as independent variables and both the mean score of willingness to pay premium and repurchase intention as dependent variables.

2.5.1 Willingness to pay for premium items regression

In this case we are not considering

lmreg <- lm(wtp ~. , data = dfreg[, -c(10,11,12)])
summary(lmreg)
## 
## Call:
## lm(formula = wtp ~ ., data = dfreg[, -c(10, 11, 12)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.03969 -0.48095  0.07384  0.61833  1.93388 
## 
## Coefficients:
##                                      Estimate             Std. Error t value
## (Intercept)             0.0000000000000001677  0.0256051125114434251   0.000
## Serviceability          0.1084876242808307384  0.0347805486970546793   3.119
## Performance             0.1235641083793122125  0.0483914731780226426   2.553
## Aesthetics              0.0777680081993370326  0.0411064953037280148   1.892
## `Ease of use`           0.1837353679503051229  0.0397588357996485689   4.621
## Features                0.1388633189439077154  0.0440492847848557786   3.152
## `Quality of materials`  0.0996281742106530088  0.0420028528111159663   2.372
## Flawlessness           -0.0373575534593991038  0.0366357786705613017  -1.020
## Durability              0.0229922469370993400  0.0444428483763980794   0.517
##                          Pr(>|t|)    
## (Intercept)               1.00000    
## Serviceability            0.00187 ** 
## Performance               0.01082 *  
## Aesthetics                0.05881 .  
## `Ease of use`          0.00000433 ***
## Features                  0.00167 ** 
## `Quality of materials`    0.01789 *  
## Flawlessness              0.30813    
## Durability                0.60504    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7971 on 960 degrees of freedom
## Multiple R-squared:   0.37,  Adjusted R-squared:  0.3647 
## F-statistic: 70.46 on 8 and 960 DF,  p-value: < 0.00000000000000022
lmreg <- lm(wtp ~. , data = dfreg[, -10])
summary(lmreg)
## 
## Call:
## lm(formula = wtp ~ ., data = dfreg[, -10])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.93299 -0.49530  0.05103  0.56617  1.94723 
## 
## Coefficients:
##                                      Estimate             Std. Error t value
## (Intercept)             0.0000000000000002183  0.0251334931373083381   0.000
## Serviceability          0.1093773824378367204  0.0341476919008578683   3.203
## Performance             0.1170862046538173468  0.0475176876958934846   2.464
## Aesthetics              0.0645188059910867112  0.0404179843546417644   1.596
## `Ease of use`           0.1860196996983554674  0.0391696140168363849   4.749
## Features                0.1125164023430060556  0.0434598099201612131   2.589
## `Quality of materials`  0.0800595846253271126  0.0413507024275512683   1.936
## Flawlessness           -0.0195620081434142966  0.0360771398861684939  -0.542
## Durability              0.0369809331208005160  0.0438036551636300855   0.844
## qd4                     0.0318260303574137565  0.0259318681609654669   1.227
## qd23                    0.1519932068704890060  0.0265341436120756745   5.728
##                            Pr(>|t|)    
## (Intercept)                 1.00000    
## Serviceability              0.00140 ** 
## Performance                 0.01391 *  
## Aesthetics                  0.11075    
## `Ease of use`          0.0000023555 ***
## Features                    0.00977 ** 
## `Quality of materials`      0.05315 .  
## Flawlessness                0.58779    
## Durability                  0.39874    
## qd4                         0.22001    
## qd23                   0.0000000136 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7824 on 958 degrees of freedom
## Multiple R-squared:  0.3942, Adjusted R-squared:  0.3879 
## F-statistic: 62.34 on 10 and 958 DF,  p-value: < 0.00000000000000022
lmreg <- lm(ri ~. , data = dfreg[, -c(9,11,12)])
summary(lmreg)
## 
## Call:
## lm(formula = ri ~ ., data = dfreg[, -c(9, 11, 12)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8802 -0.4482  0.1294  0.5058  2.3078 
## 
## Coefficients:
##                                      Estimate             Std. Error t value
## (Intercept)            0.00000000000000007059 0.02410019026951355781   0.000
## Serviceability         0.08242737399210432747 0.03273634673163351716   2.518
## Performance            0.23462766571749182898 0.04554729882523227830   5.151
## Aesthetics             0.05407319770365901640 0.03869049033430174700   1.398
## `Ease of use`          0.21661716024210228571 0.03742203855724677425   5.788
## Features               0.05493024659332951254 0.04146031946067664509   1.325
## `Quality of materials` 0.05244392709696910765 0.03953416506792777757   1.327
## Flawlessness           0.01353651262562929544 0.03448253688546439044   0.393
## Durability             0.07910551715216894431 0.04183075163257525542   1.891
##                             Pr(>|t|)    
## (Intercept)                   1.0000    
## Serviceability                0.0120 *  
## Performance            0.00000031392 ***
## Aesthetics                    0.1626    
## `Ease of use`          0.00000000961 ***
## Features                      0.1855    
## `Quality of materials`        0.1850    
## Flawlessness                  0.6947    
## Durability                    0.0589 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7502 on 960 degrees of freedom
## Multiple R-squared:  0.4418, Adjusted R-squared:  0.4372 
## F-statistic: 94.99 on 8 and 960 DF,  p-value: < 0.00000000000000022
lmreg <- lm(ri ~. , data = dfreg[, -9])
summary(lmreg)
## 
## Call:
## lm(formula = ri ~ ., data = dfreg[, -9])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8769 -0.4056  0.1215  0.4963  2.4138 
## 
## Coefficients:
##                                     Estimate            Std. Error t value
## (Intercept)            0.0000000000000001066 0.0238435065776436139   0.000
## Serviceability         0.0821377618657198327 0.0323950479943771394   2.536
## Performance            0.2291171109720118726 0.0450788234226636511   5.083
## Aesthetics             0.0435239679610242361 0.0383435151870897203   1.135
## `Ease of use`          0.2224735732750877104 0.0371592179547796733   5.987
## Features               0.0378231217820571078 0.0412292178422390201   0.917
## `Quality of materials` 0.0384293563409503283 0.0392283611328963014   0.980
## Flawlessness           0.0255842665155526089 0.0342254662922891473   0.748
## Durability             0.0930778523147515052 0.0415554150914455717   2.240
## qd4                    0.0548441609991407522 0.0246009046847828830   2.229
## qd23                   0.0936607704731830248 0.0251722681081497229   3.721
##                             Pr(>|t|)    
## (Intercept)                  1.00000    
## Serviceability               0.01139 *  
## Performance            0.00000044762 ***
## Aesthetics                   0.25661    
## `Ease of use`          0.00000000302 ***
## Features                     0.35917    
## `Quality of materials`       0.32752    
## Flawlessness                 0.45493    
## Durability                   0.02533 *  
## qd4                          0.02602 *  
## qd23                         0.00021 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7422 on 958 degrees of freedom
## Multiple R-squared:  0.4548, Adjusted R-squared:  0.4491 
## F-statistic: 79.92 on 10 and 958 DF,  p-value: < 0.00000000000000022

2.6 Do the results of the regression analysis for repurchase intentions differ across brands?

dfreg$brandrec <- as.factor(dfobl$brandrec)
levels(dfreg$brandrec) <- c("Apple", "Samsung", "LG", "Motorola", "Other")

ggplot(data = dfreg, aes(brandrec)) +
  geom_bar() +
  theme_classic() +
  labs(title = "Repartition of brands in dataset",
       x = "Brand",
       y = "Count")

summary(dfreg$brandrec)
##    Apple  Samsung       LG Motorola    Other 
##      534      234       63       68       70
lmreg <- lm(wtp ~. , data = dfreg[dfreg$brandrec == "Apple", -c(10,13)])
summary(lmreg)
## 
## Call:
## lm(formula = wtp ~ ., data = dfreg[dfreg$brandrec == "Apple", 
##     -c(10, 13)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.09117 -0.36419  0.06468  0.51468  1.55683 
## 
## Coefficients:
##                        Estimate Std. Error t value          Pr(>|t|)    
## (Intercept)             0.23695    0.03163   7.492 0.000000000000291 ***
## Serviceability          0.12390    0.04381   2.828           0.00486 ** 
## Performance             0.03104    0.06392   0.486           0.62745    
## Aesthetics              0.14958    0.05106   2.929           0.00354 ** 
## `Ease of use`           0.20919    0.04912   4.259 0.000024394234904 ***
## Features                0.05796    0.05348   1.084           0.27896    
## `Quality of materials`  0.08436    0.05228   1.614           0.10720    
## Flawlessness           -0.04971    0.04431  -1.122           0.26248    
## Durability              0.03166    0.05368   0.590           0.55564    
## qd4                     0.01536    0.03193   0.481           0.63073    
## qd23                    0.07032    0.03401   2.067           0.03920 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7065 on 523 degrees of freedom
## Multiple R-squared:  0.3592, Adjusted R-squared:  0.3469 
## F-statistic: 29.31 on 10 and 523 DF,  p-value: < 0.00000000000000022
lmreg <- lm(ri ~. , data = dfreg[dfreg$brandrec == "Motorola", -c(9,13)])
summary(lmreg)
## 
## Call:
## lm(formula = ri ~ ., data = dfreg[dfreg$brandrec == "Motorola", 
##     -c(9, 13)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5591 -0.3686  0.0016  0.3651  2.2353 
## 
## Coefficients:
##                        Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)            -0.41351    0.09675  -4.274 0.0000738 ***
## Serviceability         -0.02048    0.12346  -0.166  0.868837    
## Performance             0.29686    0.15383   1.930  0.058618 .  
## Aesthetics              0.13237    0.12427   1.065  0.291280    
## `Ease of use`           0.45254    0.11961   3.783  0.000373 ***
## Features               -0.16613    0.16445  -1.010  0.316664    
## `Quality of materials` -0.13988    0.13441  -1.041  0.302392    
## Flawlessness            0.06691    0.14153   0.473  0.638161    
## Durability              0.11952    0.14783   0.808  0.422179    
## qd4                     0.11227    0.10625   1.057  0.295134    
## qd23                    0.01087    0.09431   0.115  0.908618    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6805 on 57 degrees of freedom
## Multiple R-squared:  0.6125, Adjusted R-squared:  0.5445 
## F-statistic:  9.01 on 10 and 57 DF,  p-value: 0.00000001086

2.7 What do the results of these regressions imply?

2.8 Please compare the brands on the factor scores. What are points of parity and points of difference for the different brands.

ggplot(data = melt(dfreg[, c(1:8, 13)]), aes(y = value, color = brandrec)) +
  facet_wrap(~ variable, nrow = 2) +
  geom_boxplot() +
  theme_classic() +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  labs(title = "Boxplots of factor scores per brand",
       y = "Factor score",
       color = "Brand")